Project - Twitter US Airline Sentiment


Background & Context:

Objective:

Data Description:

Twitter data was scraped during the month of February of 2015. The dataset comprises of following information:


Table of Contents (TOC)

- Importing Packages
- Unwrapping Customer Information
- Data Pre-Processing & Sanity Checks
- Summary of Data Analysis
- EDA Analysis
- Content Preprocessing
- Model Building
- Summary of the Modeling
- Recommendations

Importing required Packages:

Click to return to TOC


Unwrapping the Customer Information:

Click to return to TOC


Data Information:

Based on the high level data values, the following is the understanding of the features in this dataset:

Click to return to TOC

Data Description:

Data Preprocessing & Sanity Checks

Click to return to TOC


Dropping the Tweet ID column

Checking for Duplicates

Inferences:

Checking for Uniqueness

Inferences:

Extracting the Date, Month & Year from the timestamp

Missing Value analysis

Inferences:

Based on the above plot, we can say as there are 7 columns with missing values

Observing the values of the columns for patterns and data correctness

Summary of Data Analysis

Click to return to TOC

Data Structure:

Data Cleaning:

Data Description:

Data Observations:

Based on the data information:


Common Functions

EDA Analysis - Analyzing respective attributes to understand the data pattern

Click to return to TOC


Analyzing the count and percentage of Categorical attributes using a bar chart

Insights from Categorical Data

Click to return to TOC

Observations:

Analyzing the Numerical attributes using Histogram and Box Plots

Insights from Numerical Data

Click to return to TOC

Observations:

EDA - Analysis based on respective Features

Click to return to TOC


Distribution of Sentiment of Tweets

Inferences:

- 62.7% of the tweets are of Negative sentiments, followed by 21% of Neutral tweets 

Reasons for Negative tweets

Inferences:

Gold Reasons for Negative tweets

Inferences:

No. of Characters in Tweet

Inferences:

- Positve sentiment tweets have very less words than Negative sentiments. 
- There is no substantial difference between positive and neutral sentiments tweet with respect to the length of characters used in the tweet. The Negative tweets have more characters which is expected since more tweeting would have been done to report their concerns

No. of Words in Tweet

Inferences:

- Positve sentiment tweets have very less words than Negative sentiments. 
- There is no substantial difference between positive and neutral sentiments tweet with respect to the number of workds used in the tweet. The Negative tweets have more characters which is expected since more tweeting would have been done to report their concerns

Distribution of words for each class

Inferences:

- As we see above, the number of words in positive and neutral tweets are almost the same while the negative tweets are of a longer lenght. The plot is close to normal distribution

Tweet distribution by date

Inferences:

- There are more tweets during 22nd & 23rd of Feb. Probably there was a lot of activity in the airline industry which cased the peak in tweets
- For the remaining days, the volume of tweets seems to be uniform

Most active hour on twitter

Inferences:

- Significant tweets are between 7:00 to 20:00 hour of the day at the most, with most tweets during the hours of 9-11am
- Tweets are less during the night post 20:00 hours and midnight 

Timezone of Tweets (Top 10)

Inferences:

- EST US Timezone has highest number of user tweets followed by CST US Timezone

Location of Tweets (Top 10)

Inferences:

- Boston has highest number of user tweets followed by New York, NY

EDA - Analysis based on Airlines

Click to return to TOC


Distribution of Tweets among each Airlines

Inferences:

- There are 6 airlines data that are being considered for this classificiation. 
- "United" airlines has received more tweets, followed by "US Airways" & "American"
- 26% of the tweets are for United airlines

Distribution of Sentiment of Tweets across Airlines

Inferences:

Almost all airlines have recived more of Negative comments. Looks like users tend to tweet more when they need to convey Negative message based on their concerns or issues faced.

Analysing the sentiments based on the respective airlines:


Reasons for Negative tweets across Airlines

Inferences:

Almost all airlines have reported "Customer Service Issue" has the top reasons except Delta airlines which has "Late Flights" as the major concern.

Analysing the negative reasons based on the respective airlines:


Most retweeted Tweet across Airlines

Inferences:

Analysing the negative reasons based on the respective airlines:


Analysis of the Retweets based on Positive Sentiment across Airlines

Inferences:

Analysing the retweets based on the respective airlines:

Analysis of the Retweets based on Negative Sentiment across Airlines

Inferences:

Analysing the retweets based on the respective airlines:

Most common words in the positive, negative & nextural Sentiment tweets - Before Data PreProcessing

Click to return to TOC

Common words in Positive Sentiment tweet

Common words in Negative Sentiment tweet

Common words in Neutral Sentiment tweet

Inferences:

Storing the original text for later reference

Dropping the Feature columns with too many missing values or with irrelevant information

Inferences:


Data Preprocessing

Click to return to TOC


Following pre processing steps will be performed to prepare the data for the sentiment analysis

Removal of HTML tags

Removal of Emojis

Replace Contractions

Removal of Numbers

Removal of URLs

Removal of Mention in tweets

Tokenization of Data

Installing Stopwords

Removal of Non ASCII codes

Lowercase conversion

Removal Punctuations

Removal of Stop Words

Lemmatize Words


Most common words in the positive, negative & nextural Sentiment tweets - Post Data Processing

Click to return to TOC

The most Common words in content column

Finding common words for positive sentiment tweets

- Words like thank, flight, get, great, love, service in the positive sentiment tweet.

Finding common words for negative sentiment tweets

- Words like flight, get, cancel, delay, service, time in the negative sentiment tweet.

Finding common words for neutral sentiment tweets

- Words like flight, get, thank, need, please, help in the neutral sentiment tweet.

Sentiment Model Analysis

Click to return to TOC

Sentiment Analysis using Supervised Learning Methods

Click to return to TOC

Building the model based on CountVectorizer and Random Forest

Click to return to TOC

The test accuracy for the basic RF classifier is at 73%. Further optmizing of the RF can be done </br>

The optimal estimator is derived using the Cross validation </br>

The Test accuracy is at 76% which is inline with the Train accuracy </br>

Building the model based on CountVectorizer and GradientBoostingClassifier

Building the model based on CountVectorizer and DecisionTreeClassifier

Model Comparison Summary:

From the Confusion Matrix we could see that the model works well and the predictions are to some extent inline with the actual sentiment values. The RF model can be tuned further for higher accuracy results and we can also try tuning the other models for better accuracy. </br> </br>

Top 40 Features Wordcloud

The above set of words were considered as the features in deriving the sentiment of each tweets. </br> </br>


Building the model based on Term Frequency(TF) - Inverse Document Frequency(IDF)

Click to return to TOC

Building the model based on TfidfVectorizer and GradientBoostingClassifier

Building the model based on TfidfVectorizer and DecisionTreeClassifier

Model Comparison Summary:

- Based on the 3 models, we infer that Random Forest Classifier has a better accuracy score and hence we will proceed to use Random Forest model for predicting the results

From the Confusion Matrix we could see that the model works well and the predictions are to some extent inline with the actual sentiment values. The RF model can be tuned further for higher accuracy results and we can also try tuning the other models for better accuracy. </br> </br>

Top 40 Features Wordcloud

The above set of words were considered as the features in deriving the sentiment of each tweets. </br> </br>


Sentiment Analysis using Unsupervised Learning Methods

Click to return to TOC


Building the model based using using Text Blob

Click to return to TOC

Comparing the Sentiment scores of the tweet between Original vs Predicted

As expected the accuracy scores of this model is not very high since Textblob uses simple methods to perform the sentiment analysis. </br> </br>

Inferences:

Building the model based using Vader Sentiment

Click to return to TOC

Comparing the Sentiment scores of the tweet between Original vs Predicted

As expected the accuracy scores of the Vader model is higher than the TextBlob prediction, but definetly less accuracy than the Supervised learning classifier models. It also took a longer time for the analysis. </br> </br>

Inferences:

Summary of the Modeling

Click to return to TOC

EDA - Based on Sentiments across Airlines:

Exploratory anaylsis was done for the various features and also with respect to the Sentiments feature. Also reported the distribution & impact of the various features across each airlines.

Data PreProcessing:

As part of the preparation for the sentiment analysis, following pre processing steps were performed. Used the NLTK library to tokenize words, remove stopwords and lemmatize the remaining words

- Remove html tags.
- Replace contractions in string. (e.g. replace I'm --> I am) and so on.
- Remove numbers.
- Remove URLs.
- Removal of @Mention
- Tokenization.
- Removal of Non Ascii codes
- Conversion to lowercase
- Removel of special Characters & Punctuations
- Remove Stopwords
- Lemmatized data

- Word Cloud chart was produced to analyse the sentiments using the processed tweet information
- Top 15 key words were determined which are part of the respective tweet sentiments

Modeling - Sentiment Analysis:

Recommendations

Click to return to TOC

Based on the study of the tweets, following are the recommendations